feat(db) resource version cas#1292
Conversation
30e8540 to
46c93c3
Compare
Adds Compare-And-Swap (CAS) based optimistic concurrency control to prevent lost updates in concurrent modification scenarios. This implements client-driven CAS for update operations and proper atomic create protection. Database Changes: - Add resource_version field to ObjectMeta proto message - Add resource_version column to objects table (migration 005) - Implement update_message_cas for single-attempt versioned updates - Add WriteCondition::MatchResourceVersion for CAS enforcement - Add PersistenceError::Conflict for version mismatch detection Protected Operations: - AttachSandboxProvider: uses client-driven CAS with expected_resource_version parameter - DetachSandboxProvider: uses client-driven CAS with expected_resource_version parameter - UpdateProvider: extracts resource_version from Provider.metadata, validates against current version - UpdateConfig: uses client-driven CAS for policy backfill path - CreateProvider: uses WriteCondition::MustCreate for atomic creation - SSH session operations: proper CAS protection CAS Modes: - Client-driven (expected_version > 0): Client fetches resource, uses its version for update. Conflict returns ABORTED status. - Server-driven (expected_version = 0): Server uses current DB version. Used for internal operations. CLI Changes: - Update attach/detach operations to fetch sandbox first and use its resource_version for CAS protection - Add clear error messages for ABORTED status on CAS conflicts - Add expected_resource_version: 0 to all UpdateConfig requests Testing: - 12 integration tests for concurrent modification scenarios - Tests verify ABORTED status on version conflicts - Coverage for all protected operations Documentation: - Update architecture/gateway.md with CAS design and semantics - Document expected_version parameter modes - List all client-driven CAS operations Signed-off-by: Derek Carr <decarr@redhat.com>
46c93c3 to
4931e83
Compare
|
@johntmyers ptal. updated per prior feedback, also updated all proto paths that required CAS enablement. the recently merged draft chunk work is not covered right now, i would like to leave that as a follow-on. |
There was a problem hiding this comment.
List responses still decode the stored protobuf payload directly, but the row resource_version is authoritative and the payload often contains 0 or the previous version. That means clients can receive stale/zero versions from list APIs and then get avoidable CAS failures or skip client-driven CAS with an unusable version.
Please hydrate metadata.resource_version = record.resource_version after decoding, or add a typed list_messages<T>() helper that mirrors get_message*. I see this pattern in ListProviders, ListSandboxes, and ListServices.
There was a problem hiding this comment.
This startup-resume error path still bypasses the new CAS/resource-version discipline. resume_persisted_sandboxes decodes sandboxes from list payloads, then mark_sandbox_error writes the mutated sandbox back with put_message, which does not use the DB row version and does not advance resource_version. A real phase/status change can be persisted without clients detecting it via CAS.
Please update this path with update_message_cas or put_if(..., MatchResourceVersion(record.resource_version)) so the Error transition increments the row version like the other sandbox mutations.
| sandbox.object_name() | ||
| ))); | ||
| // Generate UUID for database row and update metadata.id to match | ||
| let sandbox_id = uuid::Uuid::new_v4().to_string(); |
There was a problem hiding this comment.
handle_create_sandbox already generates and validates a sandbox ID before calling into ComputeRuntime, but this method replaces it with a second UUID. That means validation, logging, and any caller-side correlation around the first ID refer to an object ID that is never persisted or sent to the driver.
Please keep the server-constructed ID from the incoming Sandbox and use MustCreate with that ID, rather than regenerating it here.
Summary
Add Compare-And-Swap (CAS) infrastructure for safe concurrent object mutations
and migrate critical paths to use it. This prevents lost updates in HA
deployments with multiple gateway replicas.
Core infrastructure:
Migrations:
Database migrations backfill existing rows with resource_version = 1.
CAS updates increment atomically: resource_version = resource_version + 1.
gRPC handlers map PersistenceError::Conflict to ABORTED status code
to signal clients to retry with fresh data. Server-side retries use
bounded retry (5 attempts) with fresh reads on each iteration.
Test coverage includes concurrent update scenarios and handler-level
resource_version round-trip tests.
Related Issue
Fixes #1255
Changes
Testing
mise run pre-commitpassesChecklist